Add amdgpu intrinsics #1976

Flakebi · 2025-12-14T13:29:35Z

Add intrinsics for the amdgpu architecture.

I’m not sure how to add/run CI (ci/run.sh fails for me e.g. for
nvptx because core cannot be found), but I checked that it compiles
without warnings with

$ CARGO_BUILD_RUSTFLAGS=-Ctarget-cpu=gfx900 cargo check --target amdgcn-amd-amdhsa -Zbuild-std=core

Tracking issue: rust-lang/rust#149988

rustbot · 2025-12-14T13:29:40Z

r? @sayantn

rustbot has assigned @sayantn.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

sayantn · 2025-12-14T18:26:34Z

I think it can be "tested" the same way we do core::arch::nvptx (this is also a similar NO_RUN, NO_STD target). For those targets, we just test in CI that it builds. You can just look at how we handle nvptx64-nvidia-cuda in .github/workflows/main.yml. Also you would need to add this target to ci/dox.sh to check that the documentation builds

Flakebi · 2025-12-14T20:31:39Z

Thanks, I tried to replicate what’s there for nvptx + adding -Ctarget-cpu (as amdgpu doesn’t compile without, it has no generic cpu) + build-std=core.
I still can’t build ci/run-docker.sh or ci/run.sh locally (ci/dox.sh seems to work if I remove s390x), but I think I got it working now.
Sorry for the many force-pushes!

The diff to my first push from when opening this PR is here: https://github.com/rust-lang/stdarch/compare/b3f5bdae0efbdd5f7297d0225623bd31c7fe895b..e1028110e77561574bfb7ea349154d46b5ea7b86

sayantn

Thanks, I have put some comments

ci/dox.sh

crates/core_arch/src/amdgpu/mod.rs

sayantn · 2025-12-16T07:29:50Z

Also I have noticed that Clang provides a different set of intrinsics than these (amdgpuintrin.h and gpuintrin.h). Is this divergence intentional? We normally stick to following the C conventions in stdarch

Flakebi · 2025-12-16T09:09:24Z

Thanks for the review!

Interesting, I thought core_arch is more akin to the __builtin_* family of functions in clang.
The clang __gpu intrinsics are definitely useful, though they are (to no surprise) missing low-level, hardware-specific intrinsics (e.g. wave_id, sethalt and buffer/image intrinsics, which I can’t add yet because they need simd type support in rustc).

I do plan to write a common gpu module for amdgpu and nvptx as well.
Maybe this one should follow gpuintrin.h?
We could e.g. have

mod amdgpu {
    mod gpu {
        use super::*;
        pub fn block_id_x() -> u32 {
            workitem_id_x()
        }
    }
}
// Same for nvptx

mod gpu {
    #[cfg(target_arch = "amdgpu")]
    pub use amdgpu::gpu::*;
    #[cfg(target_arch = "nvptx64")]
    pub use nvptx::gpu::*;

    // + more intrinsics as in gpuintrin.h
}

sayantn · 2025-12-16T09:37:20Z

The analogue to __builtin functions in C are #[rust_intrinsic] functions and linked LLVM intrinsics. They are generally not exposed to users (in stable) as they are tightly linked to the compiler implementation.

If there are some interesting platform-specific intrinsics, we can add them, but generally we follow GCC and Clang in regards to intrinsics.

Yea, a common gpu module can be added following gpuintrin.h.

Add intrinsics for the amdgpu architecture.

Flakebi · 2025-12-18T14:11:23Z

Thanks for the review, that’s a nice case for const generics! I fixed all your inline comments.

Also thanks for linking the clang intrinsics, I tend to forget about those.
As far as I am aware, the moste used APIs for GPUs are CUDA/HIP/SYCL on the compute side and DirectX/Vulkan (through HLSL/GLSL) on the graphics side. Although clang is a building block for these, the clang intrinsics are most often not used directly. Therefore I’m rather hestitant to follow clang for GPUs.

As common GPU intrinsics is a larger topic, I started a Discourse post here (please add your thoughts!): https://internals.rust-lang.org/t/naming-gpu-things-in-the-rust-compiler-and-standard-library/23833
Though, I think progress on this doesn’t need to block amdgpu intrinsics (note that I don’t have a plan to stabilize amdgpu intrinsics, stabilizing generic GPU intrinsics seems more desirable).

My planned intrinsic “stack” is roughly as follows, from high-level to low-level:

Rust exposes (experimental) generic gpu intrinsics, e.g. at core::gpu, naming tbd, see the Discourse post
These are implemented for amdgpu and nvptx by calling target-specific intrinsics from core::arch::amdgpu/nvptx
core::arch::amdgpu/nvptx intrinsics expose (not on stable) their respective LLVM intrinsics, naming can follow the LLVM intrinsic names

We could also use clang names for core::arch::amdgpu/nvptx, then we would end up with three set of names:

Generic core::gpu intrinsics: Rust-specific names
core::arch::amdgpu/nvptx: clang names
LLVM intrinsics (used in core::arch): LLVM names

I would prefer following LLVM’s intrinsic names in core::arch::amdgpu/nvptx instead of clang’s names.
What do you think?

Note for my future self: I got run.sh to work with NORUN=1 NOSTD=1 TARGET=amdgcn-amd-amdhsa CARGO_UNSTABLE_BUILD_STD=core ci/run.sh.

sayantn

Thanks, left some more comments

sayantn · 2025-12-18T18:29:52Z

crates/core_arch/src/amdgpu/mod.rs

+//! [LLVM implementation]: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+
+#[allow(improper_ctypes)]
+unsafe extern "unadjusted" {


is there some specific reason you only made some of the (both LLVM and Rust) intrinsics safe? Normally we make an intrinsic unsafe only if it can cause memory unsafety, and I don't see how something like s.barrier.signal is unsafe, where s.barrier is safe?

Yes, I tried to follow the Rust safe/unsafe principles.
Every intrinsic that is marked safe will not cause undefined behavior or memory safety issues, not matter with which arguments it is called. They could still cause other issues, like deadlocks if s.barrier is used incorrectly, but this is still “safe” in the Rust sense.
Intrinsics that are unsafe can cause undefined behavior, either in the compiler/LLVM or in the hardware when calling them with incorrect arguments. E.g. readlane or permlane can read registers from inactive threads (which I guess is either directly UB in LLVM or poison, which can cause UB further down the line).
For s.barrier.signal, I’m not sure what the allowed BARRIER_TYPEs are. What happens when passing an invalid number for the type seems to be currently unspecified (the documentation says “only for non-named barriers”) and could very well become undefined behavior with a future LLVM change.

sayantn · 2025-12-18T19:28:48Z

crates/core_arch/src/amdgpu/mod.rs

+    fn llvm_s_barrier_leave(barrier_type: u16);
+    #[link_name = "llvm.amdgcn.s.get.barrier.state"]
+    fn llvm_s_get_barrier_state(barrier_type: i32) -> u32;
+    #[link_name = "llvm.amdgcn.s.wave.barrier"]


Suggested change

#[link_name = "llvm.amdgcn.s.wave.barrier"]

#[link_name = "llvm.amdgcn.wave.barrier"]

typo (probably also rename the intrinsics to (llvm_)wave_barrier

sayantn · 2025-12-18T19:36:03Z

crates/core_arch/src/amdgpu/mod.rs

+    safe fn llvm_wave_reduce_add(value: u32, strategy: u32) -> u32;
+    #[link_name = "llvm.amdgcn.wave.reduce.fadd"]
+    safe fn llvm_wave_reduce_fadd(value: f32, strategy: u32) -> f32;
+    #[link_name = "llvm.amdgcn.wave.reduce.and"]


Q: I see that LLVM also defined fsub and sub versions of these intrinsics, why aren't they included here?

I couldn’t make much sense of the sub variants 😄
The reduce intrinsics do some accumulation/reduction over the 32 or 64 threads in a wave.
This can be neatly implemented as a parallel reduce (i.e. a binary tree where first thread 0+1→0, 2+3→2, … are added, then 0+2, 4+6, etc.; LLVM currently doesn’t do this… but that’s a different problem).

Doing an add reduce of multiple values (over a wave here) makes sense to me. add is an associative operation.
Doing a sub doesn’t make sense to me, because it’s not associative. The order of operations matters, so what is a parallel sub reduce supposed to do? (Also, is the very first value kept positive or subtracted from 0?)
I looked into the LLVM implementation, it seems to do a sum (add reduce) and then multiply by -1. If someone needed that functionality, I think it would be much clearer to write exactly that, reduce.add * -1. It would also be as efficient as the reduce.sub intrinsic as it doesn’t do anything different in the implementation.

sayantn · 2025-12-18T19:51:42Z

crates/core_arch/src/amdgpu/mod.rs

+    safe fn llvm_s_get_waveid_in_workgroup() -> u32;
+
+    // gfx11
+    #[link_name = "llvm.amdgcn.permlane64"]


Suggested change

#[link_name = "llvm.amdgcn.permlane64"]

#[link_name = "llvm.amdgcn.permlane64.i32"]

the type parameter should be specified

sayantn · 2025-12-18T19:54:10Z

crates/core_arch/src/amdgpu/mod.rs

+#[inline]
+#[unstable(feature = "stdarch_amdgpu", issue = "149988")]
+pub unsafe fn sched_barrier<const MASK: u32>() {
+    static_assert_uimm_bits!(MASK, 11);


Is this check enough? It seems like MASK can only take values mentioned above, not any linear combination of them -- in that case we should also check MASK.is_power_of_two() or something like that. Also nit, it might be better to make some constants for these values, but I'll leave it to your discretion

I see it used like a proper bitmask here (15/0xf), so I think it’s fine, but thanks for checking!
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.barrier.ll#L16

sayantn · 2025-12-18T19:55:46Z

crates/core_arch/src/amdgpu/mod.rs

For some reason when I tried to build this with all #[inline]s replaced by #[inline(never)] (so that the functions actually codegen), I get a segfault for permlane64_u32, and LLVM selection errors for s.get.waveid.in.workgroup and wave.id. This is probably due to missing target features or something similar, I noticed that you made comments that these are only available in gfx12 and higher (probably, my knowledge of GPUs is nonexistent), but we are building with target-cpu=gfx900.

Ah yes, I can reproduce this.
There are quite a few “target-cpu”s (gfx numbers). Every generation of GPUs adds four or more of them.
E.g. for the RX 90xx series, there’s gfx1200 (RX 9060 (XT)), gfx1201 (RX 9070 (XT)), gfx1250 and gfx1251 (for APUs).
The full list is available here: https://llvm.org/docs/AMDGPUUsage.html#processors

The HPC/datacenter cards (MI 100, MI 250, … in case you’ve heard the names) are building upon gfx9, so new releases are gfx90a, gfx942, etc.

The amdgpu LLVM backend has about one target feature per intrinsic and a large list of target features for every generation and variant.
I’m hesitant to introduce such things as target-features to Rust as it quickly becomes a maintenance burden.

sayantn · 2025-12-18T20:01:08Z

I don't have any strong preferences, if you feel that the lower-level LLVM intrinsics are more useful that just having the clang intrinsics, go ahead with that -- we can change it, remove it, or add the clang intrinsics too if required (as long as they are unstable)

rustbot assigned sayantn Dec 14, 2025

Flakebi force-pushed the amdgpu-intrinsics branch 5 times, most recently from d6043df to e102811 Compare December 14, 2025 20:28

sayantn requested changes Dec 16, 2025

View reviewed changes

ci/dox.sh Outdated Show resolved Hide resolved

crates/core_arch/src/amdgpu/mod.rs Outdated Show resolved Hide resolved

crates/core_arch/src/amdgpu/mod.rs Show resolved Hide resolved

Add amdgpu intrinsics

d850408

Add intrinsics for the amdgpu architecture.

Flakebi force-pushed the amdgpu-intrinsics branch from e102811 to d850408 Compare December 18, 2025 14:09

sayantn requested changes Dec 18, 2025

View reviewed changes

	#[link_name = "llvm.amdgcn.s.wave.barrier"]
	#[link_name = "llvm.amdgcn.wave.barrier"]

	#[link_name = "llvm.amdgcn.permlane64"]
	#[link_name = "llvm.amdgcn.permlane64.i32"]

Add amdgpu intrinsics #1976

Are you sure you want to change the base?

Add amdgpu intrinsics #1976

Conversation

Flakebi commented Dec 14, 2025

Uh oh!

rustbot commented Dec 14, 2025

Uh oh!

sayantn commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flakebi commented Dec 14, 2025

Uh oh!

sayantn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayantn commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flakebi commented Dec 16, 2025

Uh oh!

sayantn commented Dec 16, 2025

Uh oh!

Flakebi commented Dec 18, 2025

Uh oh!

sayantn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayantn Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayantn commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sayantn commented Dec 14, 2025 •

edited

Loading

sayantn commented Dec 16, 2025 •

edited

Loading

sayantn Dec 18, 2025 •

edited

Loading